{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# NLP From Scratch: Classifying Names with a Character-Level RNN\n**Author**: [Sean Robertson](https://github.com/spro/practical-pytorch)\n\nWe will be building and training a basic character-level RNN to classify\nwords. This tutorial, along with the following two, show how to do\npreprocess data for NLP modeling \"from scratch\", in particular not using\nmany of the convenience functions of `torchtext`, so you can see how\npreprocessing for NLP modeling works at a low level.\n\nA character-level RNN reads words as a series of characters -\noutputting a prediction and \"hidden state\" at each step, feeding its\nprevious hidden state into each next step. We take the final prediction\nto be the output, i.e. which class the word belongs to.\n\nSpecifically, we'll train on a few thousand surnames from 18 languages\nof origin, and predict which language a name is from based on the\nspelling:\n\n::\n\n    $ python predict.py Hinton\n    (-0.47) Scottish\n    (-1.52) English\n    (-3.57) Irish\n\n    $ python predict.py Schmidhuber\n    (-0.19) German\n    (-2.48) Czech\n    (-2.68) Dutch\n\n\n**Recommended Reading:**\n\nI assume you have at least installed PyTorch, know Python, and\nunderstand Tensors:\n\n-  https://pytorch.org/ For installation instructions\n-  :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general\n-  :doc:`/beginner/pytorch_with_examples` for a wide and deep overview\n-  :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user\n\nIt would also be useful to know about RNNs and how they work:\n\n-  [The Unreasonable Effectiveness of Recurrent Neural\n   Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/)_\n   shows a bunch of real life examples\n-  [Understanding LSTM\n   Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)_\n   is about LSTMs specifically but also informative about RNNs in\n   general\n\n## Preparing the Data\n\n.. Note::\n   Download the data from\n   [here](https://download.pytorch.org/tutorial/data.zip)\n   and extract it to the current directory.\n\nIncluded in the ``data/names`` directory are 18 text files named as\n\"[Language].txt\". Each file contains a bunch of names, one name per\nline, mostly romanized (but we still need to convert from Unicode to\nASCII).\n\nWe'll end up with a dictionary of lists of names per language,\n``{language: [names ...]}``. The generic variables \"category\" and \"line\"\n(for language and name in our case) are used for later extensibility.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from __future__ import unicode_literals, print_function, division\nfrom io import open\nimport glob\nimport os\n\ndef findFiles(path): return glob.glob(path)\n\nprint(findFiles('data/names/*.txt'))\n\nimport unicodedata\nimport string\n\nall_letters = string.ascii_letters + \" .,;'\"\nn_letters = len(all_letters)\n\n# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427\ndef unicodeToAscii(s):\n    return ''.join(\n        c for c in unicodedata.normalize('NFD', s)\n        if unicodedata.category(c) != 'Mn'\n        and c in all_letters\n    )\n\nprint(unicodeToAscii('\u015alus\u00e0rski'))\n\n# Build the category_lines dictionary, a list of names per language\ncategory_lines = {}\nall_categories = []\n\n# Read a file and split into lines\ndef readLines(filename):\n    lines = open(filename, encoding='utf-8').read().strip().split('\\n')\n    return [unicodeToAscii(line) for line in lines]\n\nfor filename in findFiles('data/names/*.txt'):\n    category = os.path.splitext(os.path.basename(filename))[0]\n    all_categories.append(category)\n    lines = readLines(filename)\n    category_lines[category] = lines\n\nn_categories = len(all_categories)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we have ``category_lines``, a dictionary mapping each category\n(language) to a list of lines (names). We also kept track of\n``all_categories`` (just a list of languages) and ``n_categories`` for\nlater reference.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(category_lines['Italian'][:5])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Turning Names into Tensors\n\nNow that we have all the names organized, we need to turn them into\nTensors to make any use of them.\n\nTo represent a single letter, we use a \"one-hot vector\" of size\n``<1 x n_letters>``. A one-hot vector is filled with 0s except for a 1\nat index of the current letter, e.g. ``\"b\" = <0 1 0 0 0 ...>``.\n\nTo make a word we join a bunch of those into a 2D matrix\n``<line_length x 1 x n_letters>``.\n\nThat extra 1 dimension is because PyTorch assumes everything is in\nbatches - we're just using a batch size of 1 here.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch\n\n# Find letter index from all_letters, e.g. \"a\" = 0\ndef letterToIndex(letter):\n    return all_letters.find(letter)\n\n# Just for demonstration, turn a letter into a <1 x n_letters> Tensor\ndef letterToTensor(letter):\n    tensor = torch.zeros(1, n_letters)\n    tensor[0][letterToIndex(letter)] = 1\n    return tensor\n\n# Turn a line into a <line_length x 1 x n_letters>,\n# or an array of one-hot letter vectors\ndef lineToTensor(line):\n    tensor = torch.zeros(len(line), 1, n_letters)\n    for li, letter in enumerate(line):\n        tensor[li][0][letterToIndex(letter)] = 1\n    return tensor\n\nprint(letterToTensor('J'))\n\nprint(lineToTensor('Jones').size())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Creating the Network\n\nBefore autograd, creating a recurrent neural network in Torch involved\ncloning the parameters of a layer over several timesteps. The layers\nheld hidden state and gradients which are now entirely handled by the\ngraph itself. This means you can implement a RNN in a very \"pure\" way,\nas regular feed-forward layers.\n\nThis RNN module (mostly copied from [the PyTorch for Torch users\ntutorial](https://pytorch.org/tutorials/beginner/former_torchies/\nnn_tutorial.html#example-2-recurrent-net)_)\nis just 2 linear layers which operate on an input and hidden state, with\na LogSoftmax layer after the output.\n\n.. figure:: https://i.imgur.com/Z2xbySO.png\n   :alt:\n\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import torch.nn as nn\n\nclass RNN(nn.Module):\n    def __init__(self, input_size, hidden_size, output_size):\n        super(RNN, self).__init__()\n\n        self.hidden_size = hidden_size\n\n        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)\n        self.i2o = nn.Linear(input_size + hidden_size, output_size)\n        self.softmax = nn.LogSoftmax(dim=1)\n\n    def forward(self, input, hidden):\n        combined = torch.cat((input, hidden), 1)\n        hidden = self.i2h(combined)\n        output = self.i2o(combined)\n        output = self.softmax(output)\n        return output, hidden\n\n    def initHidden(self):\n        return torch.zeros(1, self.hidden_size)\n\nn_hidden = 128\nrnn = RNN(n_letters, n_hidden, n_categories)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "To run a step of this network we need to pass an input (in our case, the\nTensor for the current letter) and a previous hidden state (which we\ninitialize as zeros at first). We'll get back the output (probability of\neach language) and a next hidden state (which we keep for the next\nstep).\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "input = letterToTensor('A')\nhidden = torch.zeros(1, n_hidden)\n\noutput, next_hidden = rnn(input, hidden)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For the sake of efficiency we don't want to be creating a new Tensor for\nevery step, so we will use ``lineToTensor`` instead of\n``letterToTensor`` and use slices. This could be further optimized by\npre-computing batches of Tensors.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "input = lineToTensor('Albert')\nhidden = torch.zeros(1, n_hidden)\n\noutput, next_hidden = rnn(input[0], hidden)\nprint(output)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As you can see the output is a ``<1 x n_categories>`` Tensor, where\nevery item is the likelihood of that category (higher is more likely).\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Training\nPreparing for Training\n----------------------\n\nBefore going into training we should make a few helper functions. The\nfirst is to interpret the output of the network, which we know to be a\nlikelihood of each category. We can use ``Tensor.topk`` to get the index\nof the greatest value:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def categoryFromOutput(output):\n    top_n, top_i = output.topk(1)\n    category_i = top_i[0].item()\n    return all_categories[category_i], category_i\n\nprint(categoryFromOutput(output))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will also want a quick way to get a training example (a name and its\nlanguage):\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import random\n\ndef randomChoice(l):\n    return l[random.randint(0, len(l) - 1)]\n\ndef randomTrainingExample():\n    category = randomChoice(all_categories)\n    line = randomChoice(category_lines[category])\n    category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)\n    line_tensor = lineToTensor(line)\n    return category, line, category_tensor, line_tensor\n\nfor i in range(10):\n    category, line, category_tensor, line_tensor = randomTrainingExample()\n    print('category =', category, '/ line =', line)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Training the Network\n\nNow all it takes to train this network is show it a bunch of examples,\nhave it make guesses, and tell it if it's wrong.\n\nFor the loss function ``nn.NLLLoss`` is appropriate, since the last\nlayer of the RNN is ``nn.LogSoftmax``.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "criterion = nn.NLLLoss()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Each loop of training will:\n\n-  Create input and target tensors\n-  Create a zeroed initial hidden state\n-  Read each letter in and\n\n   -  Keep hidden state for next letter\n\n-  Compare final output to target\n-  Back-propagate\n-  Return the output and loss\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn\n\ndef train(category_tensor, line_tensor):\n    hidden = rnn.initHidden()\n\n    rnn.zero_grad()\n\n    for i in range(line_tensor.size()[0]):\n        output, hidden = rnn(line_tensor[i], hidden)\n\n    loss = criterion(output, category_tensor)\n    loss.backward()\n\n    # Add parameters' gradients to their values, multiplied by learning rate\n    for p in rnn.parameters():\n        p.data.add_(p.grad.data, alpha=-learning_rate)\n\n    return output, loss.item()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we just have to run that with a bunch of examples. Since the\n``train`` function returns both the output and loss we can print its\nguesses and also keep track of loss for plotting. Since there are 1000s\nof examples we print only every ``print_every`` examples, and take an\naverage of the loss.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import time\nimport math\n\nn_iters = 100000\nprint_every = 5000\nplot_every = 1000\n\n\n\n# Keep track of losses for plotting\ncurrent_loss = 0\nall_losses = []\n\ndef timeSince(since):\n    now = time.time()\n    s = now - since\n    m = math.floor(s / 60)\n    s -= m * 60\n    return '%dm %ds' % (m, s)\n\nstart = time.time()\n\nfor iter in range(1, n_iters + 1):\n    category, line, category_tensor, line_tensor = randomTrainingExample()\n    output, loss = train(category_tensor, line_tensor)\n    current_loss += loss\n\n    # Print iter number, loss, name and guess\n    if iter % print_every == 0:\n        guess, guess_i = categoryFromOutput(output)\n        correct = '\u2713' if guess == category else '\u2717 (%s)' % category\n        print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))\n\n    # Add current loss avg to list of losses\n    if iter % plot_every == 0:\n        all_losses.append(current_loss / plot_every)\n        current_loss = 0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Plotting the Results\n\nPlotting the historical loss from ``all_losses`` shows the network\nlearning:\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\nimport matplotlib.ticker as ticker\n\nplt.figure()\nplt.plot(all_losses)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Evaluating the Results\n\nTo see how well the network performs on different categories, we will\ncreate a confusion matrix, indicating for every actual language (rows)\nwhich language the network guesses (columns). To calculate the confusion\nmatrix a bunch of samples are run through the network with\n``evaluate()``, which is the same as ``train()`` minus the backprop.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Keep track of correct guesses in a confusion matrix\nconfusion = torch.zeros(n_categories, n_categories)\nn_confusion = 10000\n\n# Just return an output given a line\ndef evaluate(line_tensor):\n    hidden = rnn.initHidden()\n\n    for i in range(line_tensor.size()[0]):\n        output, hidden = rnn(line_tensor[i], hidden)\n\n    return output\n\n# Go through a bunch of examples and record which are correctly guessed\nfor i in range(n_confusion):\n    category, line, category_tensor, line_tensor = randomTrainingExample()\n    output = evaluate(line_tensor)\n    guess, guess_i = categoryFromOutput(output)\n    category_i = all_categories.index(category)\n    confusion[category_i][guess_i] += 1\n\n# Normalize by dividing every row by its sum\nfor i in range(n_categories):\n    confusion[i] = confusion[i] / confusion[i].sum()\n\n# Set up plot\nfig = plt.figure()\nax = fig.add_subplot(111)\ncax = ax.matshow(confusion.numpy())\nfig.colorbar(cax)\n\n# Set up axes\nax.set_xticklabels([''] + all_categories, rotation=90)\nax.set_yticklabels([''] + all_categories)\n\n# Force label at every tick\nax.xaxis.set_major_locator(ticker.MultipleLocator(1))\nax.yaxis.set_major_locator(ticker.MultipleLocator(1))\n\n# sphinx_gallery_thumbnail_number = 2\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "You can pick out bright spots off the main axis that show which\nlanguages it guesses incorrectly, e.g. Chinese for Korean, and Spanish\nfor Italian. It seems to do very well with Greek, and very poorly with\nEnglish (perhaps because of overlap with other languages).\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Running on User Input\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def predict(input_line, n_predictions=3):\n    print('\\n> %s' % input_line)\n    with torch.no_grad():\n        output = evaluate(lineToTensor(input_line))\n\n        # Get top N categories\n        topv, topi = output.topk(n_predictions, 1, True)\n        predictions = []\n\n        for i in range(n_predictions):\n            value = topv[0][i].item()\n            category_index = topi[0][i].item()\n            print('(%.2f) %s' % (value, all_categories[category_index]))\n            predictions.append([value, all_categories[category_index]])\n\npredict('Dovesky')\npredict('Jackson')\npredict('Satoshi')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The final versions of the scripts [in the Practical PyTorch\nrepo](https://github.com/spro/practical-pytorch/tree/master/char-rnn-classification)_\nsplit the above code into a few files:\n\n-  ``data.py`` (loads files)\n-  ``model.py`` (defines the RNN)\n-  ``train.py`` (runs training)\n-  ``predict.py`` (runs ``predict()`` with command line arguments)\n-  ``server.py`` (serve prediction as a JSON API with bottle.py)\n\nRun ``train.py`` to train and save the network.\n\nRun ``predict.py`` with a name to view predictions:\n\n::\n\n    $ python predict.py Hazaki\n    (-0.42) Japanese\n    (-1.39) Polish\n    (-3.51) Czech\n\nRun ``server.py`` and visit http://localhost:5533/Yourname to get JSON\noutput of predictions.\n\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Exercises\n\n-  Try with a different dataset of line -> category, for example:\n\n   -  Any word -> language\n   -  First name -> gender\n   -  Character name -> writer\n   -  Page title -> blog or subreddit\n\n-  Get better results with a bigger and/or better shaped network\n\n   -  Add more linear layers\n   -  Try the ``nn.LSTM`` and ``nn.GRU`` layers\n   -  Combine multiple of these RNNs as a higher level network\n\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}